Layout-agnostic complex document processing system

ABSTRACT

Techniques for layout-agnostic complex document processing are described. A document processing service can analyze documents that do not adhere to defined layout rules in an automated manner to determine the content and meaning of a variety of types of segments within the documents. The service may chunk a document into multiple chunks, and operate upon the chunks in parallel by identifying segments within each chunk, classifying the segments into segment types, and processing the segments using special-purpose analysis engines adapted for the analysis of particular segment types to generate results that can be aggregated into an overall output for the entire document that captures the meaning and context of the document text.

BACKGROUND

The automated processing and understanding of content within a varietyof types of documents or portions of documents, such as from forms ortables within documents, is a difficult problem with a strong-felt needfor a solution. Although solutions do exist for analyzing documentshaving fixed, rigidly-defined layouts because it is trivial to writelogic to identify known values in known locations, the automatedhandling of documents that have flexible and/or unknown layouts is not asolved problem.

BRIEF DESCRIPTION OF DRAWINGS

Various embodiments in accordance with the present disclosure will bedescribed with reference to the drawings, in which:

FIG. 1 is a diagram illustrating an architectural framework of a systemfor layout-agnostic complex document processing according to someembodiments.

FIG. 2 is a diagram illustrating an exemplary architecture of ananalysis service for layout-agnostic complex document processingaccording to some embodiments.

FIG. 3 is a diagram illustrating an environment including a documentprocessing service of a provider network for automated formunderstanding via layout-agnostic identification of keys andcorresponding values according to some embodiments.

FIG. 4 is a diagram illustrating an example of training of one or moremachine learning models of a key-value differentiation unit according tosome embodiments.

FIG. 5 is a diagram illustrating an exemplary grouping loss functionuseful for training one or more machine learning models of a key-valuedifferentiation unit according to some embodiments.

FIG. 6 is a diagram illustrating exemplary stages for inferenceperformed by a key-value differentiation unit according to someembodiments.

FIG. 7 is a diagram illustrating additional exemplary stages forinference performed by a key-value differentiation unit according tosome embodiments.

FIG. 8 is a diagram illustrating an environment for automated formunderstanding via layout-agnostic identification of keys andcorresponding values according to some embodiments.

FIG. 9 is a diagram illustrating an exemplary loss function useful fortraining one or more machine learning models of an embedding generatoraccording to some embodiments.

FIG. 10 is a diagram illustrating exemplary edge weight determination,graph construction, and graph partitioning operations used for automatedform understanding via layout-agnostic identification of keys andcorresponding values according to some embodiments.

FIG. 11 is a flow diagram illustrating exemplary operations of a methodfor layout-agnostic complex document processing according to someembodiments.

FIG. 12 illustrates an example provider network environment according tosome embodiments.

FIG. 13 is a block diagram of an example provider network that providesa storage service and a hardware virtualization service to customersaccording to some embodiments.

FIG. 14 is a block diagram illustrating an example computer system thatmay be used in some embodiments.

DETAILED DESCRIPTION

Various embodiments of methods, apparatus, systems, and non-transitorycomputer-readable storage media for layout-agnostic complex documentprocessing are described. According to some embodiments, a documentprocessing service implemented with a variety of microservices cananalyze and “understand” the content of digital documents at cloud-scalewithout needing an advance understanding of the type or layout of thedocument. The document processing service may implement a dual-pathapproach to perform the analysis in an “online” synchronous mannerresponsive to user requests or in an “offline” asynchronous manner,where single documents or batches of documents can be analyzed. Thedocument processing service can detect and comprehend various segmentsof different types within a document, such as forms, receipts, invoices,tables, etc., and can determine what data is present within thesesegments and the inter-relationships amongst the data. For example, insome embodiments the document processing service can identify theexistence of a form within a digital document, identify which portionsof the form are “keys,” which portions of the form are “values” of thekeys, and which of the values correspond to which of the keys.Accordingly, the document processing service can provide the results ofthe analysis back to users and/or enable more complex processing to beperformed upon this detected data.

In modern industry, techniques from the field of machine learning (ML)are being adapted in a variety of fields for a variety of purposes. Forexample, using ML to analyze images to detect the existence andlocations of objects is one such application. Another applicationincludes analyzing images (that carry a representation of a document) toidentify what text is within the document. For example, to implement adocument processing service, numerous types of ML models may be utilizedto analyze digital documents seeking to, for example, identify wheretext is represented within an image representation of a document andidentify what that text is.

However, performing such operations with documents carrying arepresentation of a form, receipt, invoice, or other type of “complex”document—especially when a ML model has not been trained to “understand”a particular format of the document—remains a difficult task. Forexample, identifying the various components in a document that containsa form is problematic. In some cases, a form schema that contains thevarious form keys (e.g., “first name”, “last name”, “address”, etc.) isavailable, but the actual form layout may vary across different forms.In other cases, a form schema may not even exist. Although some opticalcharacter recognition (OCR) tools can successfully identify some of thewords contained in documents, identifying and classifying these words inthe context of forms (and other types of “segments” of a document) isnot trivial due to various noises and nuisances, including outlierdetections and ambiguity in classification—for example, the word “name”may appear multiple times. Similar issues exist for other types ofdocuments, and these issues are magnified for documents includingmultiple segment types (e.g., a form, a paragraph of text, a receipt, acolumn of text, a table, an invoice, etc.).

Prior systems are not able to accommodate these types of documentsand/or document segment types. For example, a previous approach toanalyzing forms included attempting to extract entered form data by, forexample, obtaining an image of a form (e.g., scanning a form) and tryingto analyze it based on some defined layout (or “template”) that isdefined in advance. Thus, by using advance knowledge of where aparticular field is, the system can look in that defined place to findthe value—e.g., look in a box at a particular coordinate within adocument for a “name.” However, when the layout of the form to beanalyzed changes, such systems break until the system is reconfiguredwith the details of the “new layout” in order to thereafter be able toprocess that form—for example, people have to update mappings, and thesystem needs to know which one of the mappings needs to be used for aparticular form. This problem is further complicated in that there canbe dozens of different layouts for a same form, new forms can beintroduced all the time, etc., and thus this approach doesn't scale.

However, embodiments provide an automatic layout-free system thataccommodates various types of documents and optionally including variousdocument segment types, and embodiments no longer necessarily need tohave explicit knowledge of a document (or segment, such as a form) inadvance but can instead intelligently analyze the document to detect thetype of document, document segments, and data within the documentsegments (e.g., where the keys and values are located in a form, withoutthe use of a faulty template-matching approach).

FIG. 1 is a diagram illustrating an architectural framework of a systemfor layout-agnostic complex document processing according to someembodiments. The system includes a set of one or more documentprocessing service application programming interface(s) 105, anasynchronous job manager service 114, a chunk manager service 116, andone or more analysis services 110, any of which may be implemented assoftware modules executed by one or multiple computing devices (e.g.,server computing devices) at one location (e.g., within a rack or set ofcomputing devices in a same room or building) or multiple locations(e.g., within multiple rooms or buildings at one or multiple geographiclocations—potentially across numerous cities, states, countries, orcontinents). These elements of the system may rely upon a set of datastructures (e.g., queues, database tables, flat files, etc.)—implementedwithin the system or provided by other components (e.g., otherservices)—such as the job table 118, document added queue 120, chunktable 122, document metadata table 124, chunk added queue 126, and/orchunk completed queue 128. However, these data structures may beconsolidated (e.g., multiple ones of the structures can be implementedin a joint manner) or further expanded in other embodiments according totechniques known to those of skill in the art.

In some embodiments, this system is implemented as part of a documentprocessing service 112 that itself may be implemented within a providernetwork 100. A provider network 100 provides users (e.g., user 106) withthe ability to utilize one or more of a variety of types ofcomputing-related resources such as compute resources (e.g., executingvirtual machine (VM) instances and/or containers, executing batch jobs,executing code without provisioning servers), data/storage resources(e.g., object storage, block-level storage, data archival storage,databases and database tables, etc.), network-related resources (e.g.,configuring virtual networks including groups of compute resources,content delivery networks (CDNs), Domain Name Service (DNS)),application resources (e.g., databases, application build/deploymentservices), access policies or roles, identity policies or roles, machineimages, routers and other data processing resources, etc. These andother computing resources may be provided to users as “services,” suchas a hardware virtualization service that can execute compute instances,a storage service that can store data objects, etc. The users 106 (or“customers”) of provider networks 100 may utilize one or more useraccounts that are associated with a customer account, though these termsmay be used somewhat interchangeably depending upon the context of use.Users 106 may, via a client device 104 such as a personal computer (PC),laptop, mobile device, smart device, etc., interact with a providernetwork 100 across one or more intermediate networks 107 (e.g., theinternet) via one or more interface(s) 103, such as through use ofapplication programming interface (API) calls, via a console implementedas a website or application, etc. The interface(s) 103 may be part of,or serve as a front-end to, a control plane 102 of the provider network100 that includes “backend” services supporting and enabling theservices that may be more directly offered to customers.

To provide these and other computing resource services, providernetworks 100 often rely upon virtualization techniques. For example,virtualization technologies may be used to provide users the ability tocontrol or utilize compute instances (e.g., a VM using a guest operatingsystem (O/S) that operates using a hypervisor that may or may notfurther operate on top of an underlying host O/S, a container that mayor may not operate in a VM, an instance that can execute on “bare metal”hardware without an underlying hypervisor), where one or multiplecompute instances can be implemented using a single electronic device.Thus, a user may directly utilize a compute instance hosted by theprovider network to perform a variety of computing tasks, or mayindirectly utilize a compute instance by submitting code to be executedby the provider network, which in turn utilizes a compute instance toexecute the code (typically without the user having any control of orknowledge of the underlying compute instance(s) involved).

A user 106 via a client device 104 (or a client device 104 acting underthe configuration of, or the benefit of, a user 106) may interact withthe document processing service 112 to analyze electronic (or “digital”)documents. An electronic document may be a digital file storing arepresentation of a document, such as a form or receipt. For example, anelectronic document may be a Portable Document Format (PDF) file, a wordprocessing document such as a Word™ document or Open Document Format(ODF) file, an image including a representation of a document (e.g., aJPG, GIF, PNG)—which may have been captured by an optical sensor (orcamera) of the client device 104, etc. Each electronic document mayinclude one or multiple “pages” of content, and each page may includeone or multiple document “segments” described above, where each may bethought of as a portion of a document including text arranged accordingto some common theme or arrangement.

The term “text” (or the like) may be used to refer to alphanumericdata—e.g., alphabetic and/or numeric characters, which may include Latinletters, Arabic digits, punctuation marks, symbols, characters fromother character sets, etc. As one example, text may include Unicodecharacters (e.g., currently 137,439 characters covering 146 modern andhistoric scripts, as well as multiple symbol sets and emoji). Thus, theterm text is to be understood broadly and is not meant to be limited toonly “letters” unless otherwise indicated by the context of use.

To analyze a document, a user 106 may utilize one or more API calls tothe document processing service API(s) 105, which may be carried byHyperText Transfer Protocol (HTTP) request messages destined to anendpoint associated with the provider network 100 or document processingservice API(s) 105, though in other embodiments the API calls may bemade via function/procedure calls to a library that implements thedocument processing service 112, in which case the document processingservice 112 may optionally be resident on the client device 104 (notillustrated) and thus, locally invoked by client code. For example, afirst type of API call—here represented as using a method of“startDocumentAnalysis” 130—may be sent as a request at circle (A) toindicate to the document processing service 112 that the user 106desires for asynchronous analysis to be performed for one or multipledocuments. In such an asynchronous mode of operation, the documentprocessing service API(s) 105 may respond with a “unique” (at leastwithin a particular period of time in some particular context) jobidentifier (ID) in a response (to the request) that can later be used topoll for the job's status and/or obtain eventual results. (This job IDcan be obtained, for example, by the asynchronous job manager service114 as described below, which in some embodiments is the only componentof the asynchronous path that is on a request critical path.)

By way of example, the startDocumentAnalysis 130 request may include,for example, a “Document” attribute including a number of bytes carryingthe document itself, or a set of identifiers of a location of a document(e.g., at/in a storage service). The startDocumentAnalysis 130 requestmay also include a set of feature types indicating which types offeatures the user seeks to extract from the document (e.g., all tables,all forms (or, key-value pairs), all types of features, etc.), a requesttoken (for idempotency, which could be an ASCII string of somesize—e.g., between 1 and 64 characters in length), a job tag (fortracking purposes, which also could be an ASCII string of somesize—e.g., between 1 and 64 characters in length), an identifier of anotification channel (e.g., a provider network resource name) that canbe used to send a job completion event, etc. Similarly, the response toa startDocumentAnalysis 130 request may include a job identifier asindicated above that can be used by the user to track the status of thejob and/or obtain results of the job.

In contrast to this asynchronous processing configuration, as shown bydiamond (A), another type of API call may be issued—here represented asusing a method of “analyzeDocument” 134—to the document processingservice API(s) 105 (e.g., carried by an HTTP request message) indicatingthat the user desires for synchronous analysis to be performed for oneor more documents, and thus, the results of the process to be returnedin a synchronous manner (e.g., within or otherwise identified by ananalyzeDocument response message 134, which may be carried by an HTTPresponse message sent in response to an analyzeDocument HTTP request).

By way of example, the analyzeDocument 134 call may include, forexample, a “Document” attribute including a number of bytes carrying thedocument itself, or a set of identifiers of a location of a document(e.g., at/in a storage service). The analyzeDocument 134 call may alsoinclude a set of feature types indicating which types of features theuser seeks to extract from the document, e.g., all tables, all forms(or, key-value pairs), all types of features, etc. Similarly, theresponse to the analyzeDocument call may identify results in a varietyof formats—e.g., a list of blocks identified within the document, aunique identifier of each block, a relationship between the block andother blocks (e.g., child or parent), a block type identifying a type ofthe block (e.g., block, key, page, value, line, word, etc.), anidentifier of which page the block was found on, what entity type isinferred for the block, document metadata (e.g., number of pages, size,format, etc.), or other information.

In some embodiments, the document processing service API(s) 105 may beresponsible for handling these incoming requests from users with as lowlatency as possible, and may also handle general administrative typetasks on behalf of the document processing service 112, including butnot limited to metering, throttling, whitelisting, access control,and/or other functionalities known to those of skill in the art.

Either type of request—e.g., the startDocumentAnalysis request 130 oranalyzeDocument request 134—may explicitly carry the document(s) to beanalyzed or may carry an identifier that can be used to obtain thedocument(s). For example, a request may include one or more identifiersof one or more storage locations (e.g., a URI, a storage bucket/folderor path name) where one document or multiple documents are stored.Thereafter, the document processing service 112 may use theidentifier(s) to obtain the document(s)—as one example, by sending arequest to download one or more documents based on the identifier(s) andreceiving one or more responses carrying the one or more documents.

Regarding the synchronous path, at diamond (B) the document processingservice API(s) 105 may send a request (and/or the document) to theanalysis service(s) 110 for processing, and the results of theprocessing can be passed back at diamond (C) to the document processingservice API(s) 105, and ultimately passed back to the client device 104in a synchronous manner, e.g., as a response to the analyzeDocument 134request from diamond (A).

Turning to the asynchronous path, at circle (B) the document processingservice API(s) 105 can provide the request to an asynchronous jobmanager service 114. The asynchronous job manager service 114 mayimplement the general managerial logic relating to managing asynchronousjobs and their associated state. These responsibilities may include oneor more of: creating an authoritative job identifier for the job (e.g.,which may be passed back to the client in a startDocumentAnalysisresponse 130 as described above), persisting the initial job details(e.g., the job ID, what user is involved, what is the time of therequest, what document(s) are to be processed, etc.) into the job table118 at circle (C), queuing a message/entry onto the “DocumentAdded”queue 120 (e.g., each entry indicating a job ID and an identifier of thedocument such as a URI/URL) indicating that a new job has been createdat circle (D), monitoring and maintaining the state information for acustomer job (e.g., via the job table 118), providing this stateinformation to the document processing service 112 when requested,and/or other job-related tasks. The job table 118, in some embodiments,represents the data that is needed to answer user API calls about thestatus of their asynchronous processing jobs.

The Document Added queue 120 may serve to buffer incoming asynchronousjobs (e.g., one document per job). This queue 120 may assist inimplementing a fair share scheduling that gives priority to smallerdocuments.

The chunk manager service 116 (e.g., a worker thread of the chunkmanager service 116) at circle (E) may detect that a document has beenadded to the DocumentAdded queue 120 (e.g., via polling the queue, viasubscribing to events or notifications issued by the queue, etc.). Witheach document, the chunk manager service 116 may perform a query/lookupat circle (G) with a document metadata table 124 to see if the documenthas already been validated (e.g., during a previous round of processing,where the analysis of the document may have been paused or halted)—ifso, the chunk manager service 116 need not again download the document,validate the document, examine the number of pages of the document, etc.If not, the chunk manager service 116 may detect and divide multipagedocuments into one or more “chunks” via the segmenter/validator 117 (F),where each chunk can fit into a single inference batch. In someembodiments, a chunk is made up of one or more document pages—forexample, twenty pages per chunk. The system may process chunks inparallel in a distributed manner To implement this process, the chunkmanager service 116 may be responsible to perform one or more of thefollowing: splitting the document into chunks (e.g., viasegmenter/validator 117) and storing/uploading those chunks at/to astorage location (e.g., of a storage service of the provider network100), inserting records in a chunk added queue 126 at circle (I) thatcorrespond to the segmented chunks, extracting available metadata aboutthe document file objects (such as text, tables, etc., which can beprovided later to the analysis service 110 components to improve modelaccuracy, for example) and storing this information in a documentmetadata table 124 at circle (G), maintaining persistent records of theprocessing status of each chunk in the chunk table 122 (which storesmetadata about the status and result of each chunk that is or hasrecently been processed) at circle (H), determining when all chunks of adocument have been processed by the analysis service(s) 110 (viamonitoring of chunk completed queue 128 at circle (L), discussed laterherein), aggregating a collection of chunk results into a single overallresult for the document, notifying the asynchronous job manager service114 at circle (M) once processing has been completed for a document,and/or other processing tasks. The asynchronous job manager service 114may then update its job table 118 as necessary (e.g., to indicate acompletion status of the job, etc.). Notably, one or multiple instancesof the analysis service(s) 110 may exist, which may identify chunksadded to the chunk added queue 126 (e.g., via polling the chunk addedqueue 126, by subscribing to events of the chunk added queue 126, etc.)and perform analysis on a single chunk at a time, or on multiple chunksin parallel.

The chunk added queue 126 may optionally include a set of queues whichrepresent different chunk priorities—e.g., high priority, lower priority(e.g., for larger documents), etc. The chunk added queue 126 may act asa buffer between the document segmentation process and the actualanalysis that allows for chunks to be processed in parallel as well asoptionally supporting different chunk priorities. In some embodiments,this queue 126 further assists in implementing a fair share schedulerthat could optionally favor specific chunks inside a document.

The chunk completed queue 128, upon the completion of analysis for achunk, is updated with an entry that can be is read by the chunk managerservice 116 and used to update the status of the job as a whole as wellas create the final (completed) result.

After submitting the processing job via the startDocumentAnalysis 130request (and the receipt, by the client device 104, of a job ID) theclient device 104 may “poll” the document processing service API(s) 105to determine the status of the job (e.g., complete, still processing,error, etc.), and obtain the results of the job when it is complete.These tasks may be performed by a single API call—getDocumentAnalysis132 (e.g., including a job ID, seeking the job status and optionally theanalyzed document output, if the job is finished)—or separated intodistinct API calls (e.g., getJobStatus and getDocumentAnalysis), forexample. Upon receipt of such a call, the document processing serviceAPI(s) 105 may pass the request onto the asynchronous job managerservice 114, which can consult the job table 118 (or another resultsstore, when output from a completed job is stored separately) to providethe necessary job status and/or job output back to the documentprocessing service API(s) 105 and thus back to the requesting clientdevice(s) 104. A variety of types of status values may be used accordingto the implementation, such as any or all of an invalid document typewas provided, the provided document was password protected, theidentified file is unable to be obtained, the size of the file is largerthan a threshold amount, etc.

In some embodiments, an additional responsibility of the chunk managerservice 116 is to ensure that a small number of very large jobs do notconsume all available resources and thus starve smaller jobs wherefaster processing would be expected by users. For example, if there areten chunk manager service 116 instances having some number (e.g., 10) ofthreads each, and a user submits several hundred or thousand largedocuments to be processed, it could be the case that one-hundred percentof some resource could be consumed for a long period of time and blockall other jobs until either new hardware is allocated or the originaljobs complete. Thus, in some embodiments the chunk manager service 116may analyze the document, and if there is not enough idle capacityavailable for the regular set of analysis operations, it may return thedocument to the queue until a later point in time. In this case, thechunk manager service 116 may cache the size details (in the documentmetadata table 124) so that the next time that document is processed itdoesn't need to be re-downloaded and/or analyzed to determine its size.Some embodiments may also use file-based heuristics to determine adocument's “likely” size—for instance, the number of bytes of a documentprovide a useful indication of an estimated page count.

In some embodiments, the chunk manager service 116 will segment and“validate” the documents as introduced above (e.g., viasegmenter/validator 117). For security reasons, processes that actuallyopen a document (for validating or segmenting) may need to be performedin a “sandbox” that is partitioned from the internet to avoidinadvertently participating in an attack or other malfeasance. Thus, thea segmenter/validator 117 component of the chunk manager service 116 maybe a separate, deployable unit that may co-reside or otherwise beaccessible to the chunk manager service 116 (and optionally, theanalysis service(s) 110). For example, in some embodiments theasynchronous flow utilizes both the segmenter and validator componentswhen processing the full document, whereas the synchronous flow mayinclude directly providing the document(s) directly to the analysisservice(s) 110 where they may be validated (as being proper documents,within the analysis service(s) 110 or via use of the segmenter/validator117) but in many embodiments, not segmented (as the synchronous path maybe constrained to single-page documents, for example).

Accordingly, in some embodiments the segmenter/validator 117 componentmay be a “sidecar” service, and/or may have two different APIsexposed—one that validates and returns metadata such as page count,etc., and another that performs the actual document segmentation, whichcan help support fair scheduling in that large files can be detected andrequeued in moments of increased utilization.

Notably, these various service components can be auto-scaled (to includemore or fewer instances of each service) based on resource consumptionto allow for ease of scaling up or down and ensure satisfactoryperformance of the entire system.

Embodiments may further implement protections to defend against bugsthat might trigger infinite work for the asynchronous system, e.g., byimplementing maximum retry counts, dead letter queues, etc., to combatthis situation. Thus, once a chunk has failed processing more than somethreshold number of times, it can be added to a “dead letter queue”which results in an alarm being generated.

For detail regarding how the actual processing of a document isperformed by the analysis service(s) 110, we turn to FIG. 2, which is adiagram illustrating an exemplary architecture of an analysis service110 for layout-agnostic complex document processing according to someembodiments. This exemplary architecture includes a detect text engine202, an optical character recognition (OCR) engine 206, an analyzedocument engine 204, an image preprocessing engine 208, a documentclassification engine 212, form analysis engine 214, receipt analysisengine 216, invoice analysis engine 218, generic analysis engine 220,and other supporting segmenters 222-228 and engines 230-232. Asindicated above, the analysis service 110 may be implemented usingsoftware executed by one or multiple computing devices, possibly in adistributed manner, and thus these components may similarly beimplemented as software, each executed by one or multiple computingdevices.

At circle (1), the document image is passed to the main analyze documentengine 204. At circle (2), the data of the image (carrying therepresentation of the document)—e.g., the bytes that make up theimage—are passed to the document classification engine 212, where theimage is validated to ensure that a document does exist within the image(by image rejection 210A module, which can comprise an image librarythat can ensure the image is a proper image of a document, and/or amachine learning (ML) model trained to identify document elements suchas text, lines, etc., such that a picture of a dog, for example, wouldbe rejected as improper). The image preprocessing engine 208 may alsorectify the image with image rectification 210C module (e.g., byrotating, skewing, etc., the image to align it to have 90 degree edges,etc.) and segment the image (via boundary detection 210B module) intodistinct elements—for example, into two receipts scanned on a singlepage. The boundary detection 210B module may comprise an objectdetection ML model trained to identify particular elements of adocument, e.g., a table, a form, a paragraph of text, etc., withbounding boxes that encapsulate these elements. The result of this callto the image preprocessing engine 208, in some embodiments, is an arrayof image bytes and associated bounding boxes of elements, where anelement may or may not correspond to a segment.

For each such element identified by the image preprocessing engine 208,the corresponding bytes of that element of the image are passed to thedetect text engine 202 at circle (3), which can identify the text withinthe element via use of one or more OCR engines 206 known to those ofskill in the art. In some embodiments, the calls made at circle (3) maybe done at least partially in parallel. The one or more OCR engines 206can detect and recognize text within an image, such as document keys,captions, numbers, product names, etc., and may support text in mostLatin scripts and numbers embedded in a large variety of layouts, fontsand styles, and overlaid on background objects at various orientations.The OCR engine(s) 206 may include a first engine that identifiesbounding boxes specific to the text or portions of the text (such aswords), and a second engine that uses these bounding boxes to identifythe particular text within the bounding boxes. As a result, the OCRengines 206 may generate a list of “words” found in the document in a3-tuple identifying where the word was located, what the word is, and aconfidence score: [Bounding Box, Text Value, Confidence Score].

At circle (5), the segmented image resulting from the imagepreprocessing engine 208 may be passed to the document classificationengine 212, which can identify what “segment” the element is, or whatone or multiple segments are within an element. The documentclassification engine 212 may be a ML model (such as a convolutionalneural network (CNN) or other deep network) trained to identifydifferent types of segments, such as paragraphs, forms, receipts,invoices, etc. The document classification engine 212 may thus outputidentifiers of what portions of the element include what type ofsegment—for the sake of this example, we assume that a first segment isidentified as being of type “paragraph” while a second segment isidentified as being of type “form.”

The paragraph segment is passed to the generic analysis engine 220 atcircle (6A), where it may further segment the segment intosub-segments—e.g., paragraph, line, etc. at circle (7A) via use of ageneric layout segmenter 228, optionally resulting in an array ofSemantic Region objects [Region Label, Bounding Box, Id]. Beneficially,this call can be parallelizable as the result may not be needed until afinal consolidation of the result.

Similarly, the form segment is passed to the form analysis engine 214where it may further segment the segment into sub-segments specific toforms (e.g., form title, figure, etc.) at circle (7B) by form layoutsegmenter 222, again optionally resulting in an array of Semantic Regionobjects [Region Label, Bounding Box, Id]. Likewise, this call can alsobe parallelizable as the result may not be needed until a finalconsolidation of the result.

For each of these semantic region objects representing a table, a callcan be made to a table extraction engine 230 at circle (8B) that istrained to identify components of tables—e.g., table headings, rows,etc.

Similarly, the form segment may be passed to a key value extractionengine 232 at circle (9B) which, as described in additional detailherein, can analyze the form to identify keys in the form, values in theform, and which keys correspond to which values.

Although not described in this example, for other segments similarcomponents can be utilized. For example, for a detected invoice segment,the segment can be passed to an invoice analysis engine 218, which maymake use of an invoice layout segmenter 226 that can identify semanticregion objects as described above. As invoices have fairly standardlayouts across industries and even across the world—where there are lineitems having a description of the item, a number of items, a price peritem, etc., the invoice layout segmenter 226 may comprise one or moreneural networks trained to look at such an invoice and determine whichportions are headings, which are line items, verify the invoice amounts(e.g., total items, total cost, etc.), etc., and generate dataexplaining what the invoice represents.

With the resulting information and corresponding location information(e.g., bounding boxes) of the segments, some embodiments performadditional post-processing to consolidate the information from thepotentially multiple inference pipelines into a cohesive “hierarchical”result. For example, keys and values may include multiple lines orwords, and thus post-processing can be implemented to identify boundingbox overlaps (e.g., the existence of a bounding box within anotherbounding box), which can thus be represented in a hierarchical manner(e.g., as a parent-child relationship). By way of example, a documentmay include several top-level entities (e.g., a form, a receipt, a blockof text) and each of these entities may have one or more entities (e.g.,for a form, one or more key-value pairs) and perhaps so on at furtherlevels (e.g., for a key-value pair, a key and a value; and likewise, fora value, one or multiple lines of text), etc.

For further detail of an exemplary implementation and use of a key valueextraction engine 232, FIG. 3 is a diagram illustrating an environmentincluding a document processing service of a provider network forautomated form understanding via layout-agnostic identification of keysand corresponding values according to some embodiments.

As indicated above, to utilize the document processing service 112, auser 106 may utilize a client device 104 to obtain (e.g., create,download, capture via a camera/optical sensor, etc.) an electronicdocument 310. The electronic document 310 may be a digital file storinga representation of a document such as a form. For example, theelectronic document 310 may be a Portable Document Format (PDF) file, aword processing document such as a Word™ document or Open DocumentFormat (ODF) file, an image including a representation of a document(e.g., a JPG, GIF, PNG), etc.

Optionally, the user 106 may upload 330 the electronic document 310(e.g., a digital image) to a location that is distinct from the documentprocessing service 112 at circle (1). For example, the user 106 maycause the electronic device 104 to send the electronic document 310 tobe stored at a location of a storage service 308 within (oralternatively, outside of) the provider network 100. The storage service308 may provide, in a non-illustrated response message, an identifier ofthe electronic document 310 (e.g., a Uniform Resource Locator (URL) forthe file, a unique identifier of the file within the context of thestorage service 308 or provider network 100, a name of a folder (orstorage “bucket” or group) that includes the file, etc.

At circle (2), the user 106 may utilize a client device 104 to cause itto send a processing request 334A to the document processing service112. The processing request 334A may indicate the user's desire for thedocument processing service 112 to process a particular electronicdocument 310 (or group of electronic documents) in some manner

For example, the processing request 334A may indicate that text of aform represented within the electronic document(s) 310 is to beidentified, and that text “values” corresponding to text “keys” withinthe form are to be identified and paired, stored, and/or processed insome manner As one example, an electronic document 310 may comprise animage file that was captured by an optical sensor of a scanner device ora user's mobile device, where the image file is a picture of a form suchas a government-issued form document (e.g., a W2 form, a tax returnform, etc.), a form provided by a company (e.g., a product order form,an employment application, etc.), or other type of form. Thus, theprocessing request 334A may indicate a request to identify data valuesentered within the form—e.g., a number (the value) of years ofexperience (the key) entered into an employment application—and storethose values (e.g., within a database, which may have columns/keyscorresponding to form keys), process those values (e.g., using a specialpurpose script, application, or set of rules), etc.

The processing request 334A may include (or “carry”) the electronicdocument 310 or may include an identifier of an electronic document 310(e.g., a URL). In a use-case where the processing request 334A includesthe electronic document 310, the document processing service 112 mayoptionally at circle (3) store the electronic document 310 via a storageservice 208. In a use-case where the processing request 334A identifiesthe electronic document 310, the document processing service 112 mayobtain the electronic document 310 at circle (3) based on the identifier(e.g., by sending a HyperText Transfer Protocol (HTTP) GET requestdestined to a provided URL).

To process the electronic document 310, optionally a set ofpre-processing operations may be performed—e.g., verify that a page/formis present in the image, determine whether the image needs to berotated, rectify the page, clean up noise, adjust coloring, contrast, orthe like, determine if the image is of sufficient quality (in terms ofresolution, occlusions, contrast), etc. This set of preprocessingoperations may be performed by the image preprocessing engine 208 ofFIG. 2, for example.

At circle (4) the text recognition/localization unit 314 may operateupon the electronic document 310 to identify locations of text withinthe document. The text recognition/localization unit 314 may comprise,for example, an object detection ML model trained to identify locationsof characters, words, lines of text, paragraphs, etc., as is known tothose of skill in the art. The text recognition/localization unit 314can identify the locations of the text in the form of bounding boxes,coordinates, etc.

Optionally, the text recognition/localization unit 314 can also identifythe text itself within the electronic document 310. This identificationmay include using the identified locations of text and may includeperforming an optical character recognition (OCR) process upon theselocations. Thus, this OCR procedure may be run against a subset of thedocument (in the form of the identified locations) and not against theentire document itself, which can be faster, more resource efficient,and eliminate or reduce the analysis of other non-necessary text thatmay be within the document (e.g., instructions, footnotes). In someembodiments, such text identification occurs before the operationsdescribed regarding circles (5A), (5B), and/or (6), though in otherembodiments it occurs in parallel with ones of these operations, or evenafter these operations have completed. For example, after key regionsand associated value regions have been identified (e.g., after circle(6)), key-value association unit 319 may trigger (directly orindirectly) the text recognition/localization unit 314 to identify whatthe text is within the key regions and value regions. Alternatively, thetext recognition/localization unit 314 may simply perform an OCR processthat may not include a separate location detection phase.

However, in some embodiments the text recognition/localization unit 314may utilize different techniques to achieve the same result. Forexample, the electronic document 310 may potentially be a PDF filealready including the text within the document (instead of just carryinga “flat” image alone), and the text recognition/localization unit 314may identify this text and also identify the locations of that text(e.g., from the PDF metadata, from its own object detection or matchingmodel, etc.).

At this stage, text has been detected as being represented within theelectronic document 310, and locations of the text have similarly beendetected. However, when the electronic document 310 is a form—andespecially a “new” form that has not yet been observed—there is noknowledge of which text elements are “keys” of the form and which are“values.” As is known, a form may include one or more keys such as“first name,” “last name,” “full name,” “address,” “amount,” “value,”etc., and the completed form may have values for one or more (orpossibly all) of these keys—e.g., “Dominic,” “200.50,” “Samuel,” etc. Tobe able to act upon this data in a programmatic and intelligent way, itis imperative to determine which text elements are keys and which arevalues.

The detected text elements and/or location data 340 are provided to thekey-value differentiation unit 316, which can operate on portions (orall) of the electronic document 310 at circle (5B) with this providedinformation to determine which of the text elements are keys and whichof the text elements are values. The detected text elements and locationdata 340 may be provided by sending this data (e.g., within files, orwithin a “string” or “blob” of text) directly to the key-valuedifferentiation unit 316, by storing this data in a known storagelocation (e.g., by storage service 208) that the key-valuedifferentiation unit 316 is configured to look in, by sending anidentifier of such a location to the key-value differentiation unit 316,etc. Similarly, the electronic document 310 itself may be passedaccording to similar techniques together with or separate from thedetected text elements and location data 340.

The key-value differentiation unit 316 operates by generating a featurevector for each text element (e.g., word or phrase) using one or morespecially-trained machine learning (ML) models. The feature vectorscreated for each text element of a particular electronic document 310are clustered into (at least) two different groups. The key-valuedifferentiation unit 316 can then use labeled feature vectors(indicating whether corresponding text elements are keys or values) todetermine which cluster includes feature vectors corresponding to “key”text elements and/or “value” text elements. The labeled feature vectorsmay have been provided/used during the training of the ML model(s),could be “actual” keys and values previously determined by the key-valuedifferentiation unit 316 (and confirmed as being accurately detected),or a combination of both.

For example, according to some embodiments, a key-value differentiationunit 316 can identify which words or phrases of an electronic documentare key fields and which words or phrases are key values. The key-valuedifferentiation unit generates feature vectors for detected textelements from a document using a ML model that was trained to causefeature vectors for key fields to be separated from key values, featurevectors for key fields to be close to those of other key fields, andfeature vectors for values to be close to those of other values. Thefeature vectors are clustered into two clusters. For values of eachcluster, neighbors (e.g., nearest neighbors) can be identified from alabeled set of feature vectors, and based on the labels of the neighborsfrom each cluster, the identity of each cluster is determined.

When the text elements that are keys and/or the text elements that arevalues (or “keys/values 142”) are determined, the keys/values 142 may beprovided to the key-value association unit 319 and processed at circle(6) as described above. Additionally, at circle (5A) the embeddinggenerator 318 may similarly operate on the electronic document 310 togenerate per-pixel embeddings as described above, which are alsoprovided to the key-value association unit 319 and processed at circle(6) as described above.

At this point, the key-value association unit 319 generates an outputidentifying which keys are associated with which values. This data maybe stored via one or more storage services 208 as document data 320 atoptional circle (7A). As one example, a representation of the keys andvalues may be generated (e.g., in JSON or eXtensible Markup Language(XML) format, as a string or blob of text, etc.) and stored as a file oras a database entry (or set of entries).

The output may be of a variety of formats. For example, for a value, theoutput may be a text string or a number, an indicator (e.g.,CHECKED/UNCHECKED, TRUE/FALSE, etc.) indicating whether a checkbox (orother user interface input element) is marked, an image crop (e.g., asignature field), etc.

Additionally, or alternatively, at optional circle (7B) the data may beprovided to an external (to the provider network 100) destination suchas a client device 104 of the user 106, possibly within a processingresponse message 234, which may or may not be responsive to theprocessing request 334A. For example, the processing response message234 may be sent as part of a session of communication where the user 106utilizes a client device 104 to interact with the document processingservice 112 via a web application (and thus, via a web endpointinterface 103) or another application.

As another example, in some cases the actual text may not have been yetidentified (e.g., such as when the text recognition/localization unit(s)314 have detected regions/locations including the text, but not theactual text itself), and thus at circle (7C) the key-value associationunit 319 can cause the text recognition/localization unit(s) 314 toidentify the actual text of the keys and values. For example, thekey-value association unit 319 may send a request to the textrecognition/localization unit(s) 314 that includes (or identifies) theparticular regions of interest to be analyzed for text recognition. Theresulting text could be returned to the key-value association unit 319,stored as document data 320, sent back within processing response 334B,etc., depending on the embodiment.

Although these exemplary functions and units 314/316/319 are describedas being utilized in a serial or sequential manner, in variousembodiments these functions may be implemented in other ways known tothose of skill in the art with the benefit of this disclosure. Forexample, in some embodiments the key-value association unit 319 mayoperate partially or completely in parallel with other processingunits—e.g., the key-value differentiation unit 316. In otherembodiments, the key-value association unit 319 and key-valuedifferentiation unit 316 may be implemented as a combined unit andpossibly benefit from the existence of common processing tasks needed byeach thus needing to only be performed once. Thus, many variations ofthese techniques and implementations exist which are covered by variousembodiments.

For further detail regarding how the key-value differentiation unit 316can operate to differentiate between text elements that are keys andtext elements that are values, FIG. 4 is a diagram illustrating anexample of training of one or more machine learning models of akey-value differentiation unit according to some embodiments.

In this figure, an example electronic document 310A is shown thatincludes multiple keys and multiple values. For example, a key 402 isshown as “4. TOTAL WAGES” while a value 404 is shown as “$52,304.82”.This electronic document 310A, and likely many other electronicdocuments (e.g., 310B-310Z) may be used to train one or more machinelearning models 406 to generate feature vectors 418 that are “close”together for two keys or for two values, but which are “far apart” for akey and value, allowing for keys and values to be programmaticallydistinguished.

As indicated, a large amount of training data in the form of annotatedelectronic documents 310A-310Z may be obtained or generated. Eachannotated electronic document 310 may include a set of annotations—e.g.,one or more of: an identification of what areas of the document includekeys or values (e.g., bounding boxes), an identification of what textelements are present in the document (e.g., string values), anidentification of which of the text elements are keys or values, etc. Insome embodiments, the machine learning model(s) 406 may be trained usingsynthetic documents, semi-synthetic documents (where the keys and/orlayouts are real, though the values are not), or annotated realdocuments.

In some embodiments, the machine learning model(s) 406 are trained usingan iterative process, where each iteration utilizes using a pair of textelements—e.g., a training input pair 422. The training input pair 422may include a key and another key, a key and a value, or a value andanother value. The training input pair 422 may include identifiers(e.g., coordinates, bounding box information, etc.) of the correspondinglocations of the document—or the actual data making up thoselocations—that include the text element, the text of the text element, alabel of whether the text element is a key or a value, etc.

The one or more machine learning model(s) 406 may comprise implementdifferent branches of processing that may be executed in parallel, inseries, or combinations thereof.

A first branch of the processing is performed by a text element encoder408. The text element encoder 408 in some embodiments operates upon thepair 426 of the text elements to generate a word embedding 434 for eachof the text elements (e.g., each word or phrase). Thus, the text elementencoder 408 may operate on a word or phrase basis and encode thesemantic meaning of the word (e.g., an embedding for “first” may berelatively close to the embedding for “name” but will be far away fromthe embedding for “0.20”).

A word embedding is commonly used to refer to an output generated by anyof a set of language modeling and feature learning techniques in naturallanguage processing (NLP), in which words or phrases from a vocabularyare mapped to vectors (e.g., of real numbers). Conceptually, generatinga word embedding involves a mathematical embedding from a space with onedimension per word to a continuous vector space with a much lowerdimension. Each word embedding may be generated by the text elementencoder 408 using a variety of techniques known to those of skill in theart, such as a neural network, probabilistic models, etc.

For example, in some embodiments, the text element encoder 408implements a word2vec model. Word2vec is a group of related models thatare used to produce word embeddings. These models are often relativelyshallow, two-layer neural networks that may be trained to reconstructlinguistic contexts of words. Word2vec may operate by producing anembedding for a text element within a vector space, potentiallyincluding several hundred dimensions, where each unique text element maybe assigned a corresponding vector in the space. Word vectors arepositioned in the vector space such that words that share commoncontexts in the corpus are located in close proximity to one another inthe space. The processing may occur on a word-by-word basis, forexample, and an embedding for each word in a text element may becombined (e.g., concatenated, possibly according to a consistenttechnique) into a single embedding 434.

Additionally, or alternatively, the text element encoder 408 maycomprise one or more units (e.g., layers of a neural network) that canlearn and generate font embeddings 434 for the visual aspects of thetext elements 426 (e.g., the visual representation of the text elementswithin the electronic document 310A, such as a portion of an imageincluding the text element(s)). In some cases, particular fonts or fontstyles (e.g., bold, italics, underline, etc.) may be utilized by keysand/or values, and this information can be “learned” by the text elementencoder 408 and used to generate font-based embeddings. In someembodiments, the embeddings 434 include both font-based embeddings andalso word embeddings, though in other embodiments the embeddings 434include word embeddings or font-based embeddings but not both.

A second branch of the processing is performed by a structure encoder410 using the electronic document 310A itself (together withidentifiers/bounding boxes indicating where the text elements are) orportions 428 of the document including the text elements. The structureencoder 410 may comprise a convolutional neural network (CNN) having adecoder and an encoder, which may learn a feature vector 430 for eachpixel of each text element. Each of the per-pixel feature vectors for atext element may then be “grouped” by a grouping unit 314, which maycreate a single grouped feature vector 432 for a text element. Thegrouping may be a concatenation, an averaging (or other statisticalfunction), etc., based on the individual per-pixel feature vectors 430,and thus is associated with an entire text element.

A third branch of the processing is performed by a location/size unit412, which may receive location data 424 (e.g., coordinates, boundingbox information, etc.), and/or may self-detect such location informationon a per-element (or per word, or per-phrase) basis. The location/sizeunit 412 may generate, using this provided and/or self-detectedinformation, location information 436 indicating this data in a standardformat. As one example, this information 436 can be beneficial as it maybe the case that keys and values are in different font sizes, and thusthe size of the representations of the text elements may be important toassist in differentiating between the height of the words.

A feature fusion unit 416 may operate to concatenate or consolidate allof these outputs from the three processing branches. This “combining”may or may not be simple concatenation. In some embodiments, the inputsare “compressed” (e.g., by additional layers in a neural network thatmake up the feature fusion unit 416) from a longer mapping into ashorter one by learning what is important for differentiation andseparation (of keys and values) purposes. For example, in some usagescenarios it may become evident during training that features from thetop branch (shown with a bold circle) are more important than featuresfrom the second branch (shown as a dashed linen circle) or features fromthe third branch (shown as a solid lined circle), so the feature fusionunit 416 may have the ability to learn that and encode the resultingfeature vectors 418A-218B with different amounts of data taking from thethree branches of inputs. This may occur, for example, by concatenatingthe three branches of inputs (e.g., the embeddings 434, grouped featurevector 432, and location information 436) and projecting thisconcatenated value into the resultant feature vector 418 (e.g., having asmaller dimensionality than the concatenated value). In this illustratedexample, “three units” from the first branch are taken, “three units”from the second branch are taken, and only “two units” from the thirdbranch are taken; all of which are somehow combined (e.g., consolidatedor otherwise transformed) to result in the feature vectors 418A-218B forthe pair of text elements.

Notably, although three processing branches are described in thisexample, in other embodiments more or fewer branches can be utilized.For example, in some embodiments only the “second” branch may beutilized, or in other embodiments only the first and second branches maybe utilized, as a few examples.

Based on these feature vectors 418A-218B, the machine learning model(s)406 can self-modify its features (e.g., using a grouping loss function420) so that two feature vectors are “close” in the space if bothcorrespond to keys or if both correspond to values, but be far apart ifthe vectors correspond to a key and a value.

For example, FIG. 5 is a diagram illustrating an exemplary grouping lossfunction 500 useful for training one or more machine learning models ofa key-value differentiation unit according to some embodiments. In thisexample, x is the input image (e.g., electronic document 310A) and P_(i)is the i-th pixel location in x. For the sake of simplicity, it isassumed that P_(i) ∈ [0,1]² (i.e., the image dimension is the unitsquare). Additionally, θ is the weights of the network and f_(θ)(x_(P)_(i) )∈

^(d) is the output of the network (e.g., ML model) for the correspondingi-th pixel location in x, where d is the output dimension. Thus, thenetwork is trained to minimize the loss function 500 shown in FIG. 5.

In the loss function 500, the expectation is taken over a uniformsampling strategy of pairs of pixels. The loss may be composed of asummation of two complementary terms. The first, for pixels belonging tothe same label (y_(i)=y_(j)), drives the output embeddings f_(θ)(x_(P)_(i) ) and f_(θ)(x_(P) _(j) ) closer together. The second, for pixelswith different labels (y_(i) ≠ y_(j)), encourages a separation of theembedding vectors.

Thus, by training the ML models accordingly, the resultant featurevectors for keys and values will be relatively “far” apart, while theresultant feature vectors for keys will be relatively “near” each otherand similarly the resultant feature vectors for values will also berelatively “near” each other.

For further understanding regarding the use of the key-valuedifferentiation unit 316, FIG. 6 is a diagram illustrating exemplarystages for inference performed by a key-value differentiation unitaccording to some embodiments. In this example, another electronicdocument 310B (e.g., an image including a representation of a formdocument) is shown in which several text elements are detected. One suchdetected text element 604 is shown—“CUSTOMER NAME.” At this point, thesystem is unaware of whether this text element and all other detectedtext elements are keys or values.

Thus, for each detected text element, an input package 602A-602N isprovided to the key-value differentiation unit 316 for processing, whichmay include one or more of the entire electronic document 310B (or aportion thereof, such as a portion of the document—e.g., a 10 pixel by90 pixel image—that includes the corresponding text element), the textelement that was detected (e.g., in string form), any location data ofthe text element (e.g., coordinates, sizing information, a boundingbox), etc.

The machine learning model(s) 406 operate upon this data for each textelement to generate feature vectors 418A-418N. The key-valuedifferentiation unit 316 may then cluster the feature vectors 418A-418N,e.g., using a clustering technique or algorithm known to those of skillin the art, such as (but not limited to) k-means, k-medoids, Gaussianmixture models, DBSCAN, OPTICS, etc. In some embodiments, the key-valuedifferentiation unit 316 generates two clusters 608 and 610—one toinclude the keys, and one to the include the values. However, otherembodiments may use more clusters, such as three or four (or more)clusters in an attempt to detect “faraway” feature vectors needingfurther analysis (e.g., that would be located in their own cluster, orwith few other vectors) or to detect different classes of the textelements (e.g., a cluster for paragraphs, section titles, explanatorytext, etc.). In the latter case, the key-value differentiation unit 316may again be trained with pairs of text elements, but where theindividual text elements may have more than two possible classes (i.e.,instead of just being a key or a value, instead being a key or a valueor some other class(es)). These feature vectors 618A-618N are shown inFIG. 6 as being represented as dots in a three-dimensional space forease of visual understanding; however, it is to be understood that inmany embodiments the vectors lie in a space of higher dimensionality andthus would lie in an n-dimensional space 606.

With two (or more) clusters identified, we turn to FIG. 7, which is adiagram illustrating additional exemplary stages for inference performedby a key-value differentiation unit according to some embodiments.

In some embodiments, for one or more feature vectors of one of theclusters 608/610, a neighbor feature vector from a set of labeledfeature vectors is found (e.g., using nearest neighbor techniques knownto those of skill in the art, or using a bipartite matching techniquesuch as a minimum weighted bipartite matching algorithm to identify anoverall smallest distance between feature vectors of one or multipleclusters). The set of labeled feature vectors may include one or more ofthose feature vectors created during the training of the machinelearning model(s) 406, previously-generated feature vectors, etc. Theset of labeled feature vectors are labeled in that each vector isassociated with a label of “key” or “value” (or other class(es), asindicated above).

For example, assume all labeled feature vectors are represented as “X”marks in the n-dimensional space 606. For two vectors in cluster B 610,corresponding nearest neighbors 702 are shown via arrows. It can then bedetermined that these corresponding nearest neighbors 702 have a samelabel of “key”. Thus, in some embodiments, upon some number of vectorsin a cluster being associated with a nearest neighbor having a samelabel, it can be determined that the cluster itself has the same label.In this case, cluster B 610 may thus take the label of “key.”

This nearest neighbor determination may be performed only for one ormultiple (or all) of the vectors in one cluster, and a most-frequentlyoccurring label (from all nearest neighbors in the labeled set) may bechosen as the cluster label. In some embodiments, the nearest neighboranalysis need not be performed for the other cluster A 608, as it musttake on the “other” label (e.g., “keys” when the other cluster is“values,” “values” when the other cluster is “keys”). However, in someembodiments, this analysis is performed for both clusters 608/610 toensure that both come up with an alternative label. (If both clusterswere to arrive at a same label, for example, the process may halt, andan error/alert may be generated.)

For example, as shown in FIG. 7, at 708 for cluster B there were 14nearest neighbors of “key” and 0 nearest neighbors of “value”; likewise,for cluster A there were 13 nearest neighbors of “value” and 1 nearestneighbor of “key.” Thus, as shown at 710, cluster B can be determined tobe keys, and cluster A can be determined to be value. Thereafter, theselabels can be imparted back to the corresponding text elementsthemselves.

With keys being identified and values being identified by the key-valuedifferentiation unit 316, ones of the keys can be associated with onesof the values by the key-value association unit 319. FIG. 8 is a diagramillustrating an environment for automated form understanding vialayout-agnostic identification of keys and corresponding valuesaccording to some embodiments.

They key-value association unit 319 described herein can leverage aninsight that there does exist an accepted “universal style” for forms.For example, if a person is given a new form to complete, and the formis written in a language that the person cannot read—for example, Arabicor Hebrew or Khmer—most people will still be able to easily understandwhere are the places in the form where a person is expected to providean answer—i.e., where are the fillable areas where a “value” to beadded. Similarly, for each filled “value,” individuals will easily beable to localize the question it answers—in other words, identify whatthe “key” is that is associated with a particular value. This phenomenonexists despite the fact that a person may be unable to read a formbecause there are universal styles/conventions related forms—forexample, nearly all areas around the globe make use of user interfaceelements such as checkboxes, a question with an empty line providing aspace for the associated answer, a row of adjacent boxes in which aperson is to fill in alphanumeric digits (e.g., a phone number, ZIPcode, first name, etc.) As this set of conventions is limited,embodiments can “learn” to identify these basic units of visual stylesfor forms, and then use this learned knowledge to understand new forms(in which the layout is not known in advance). For example, embodimentscan segment the different key-value units of the form, distinguishbetween key regions and value regions, and associate between values andtheir corresponding keys. A “key-value unit” is a region (for example, arectangle or other polygon) of an image (of a form or form portion) thatincludes both the key (e.g., the question a portion of the form isasking, such as “Social Security Number:”) and the area where the answer(or “value”) is to be filled. Embodiments can create a training set offorms where such key-value units have been annotated, and use a machinelearning model (e.g., neural network(s)) to learn to predict whether apair of two pixels are in a same key-value unit or different key-valueunit. The ML model generate a per-pixel embedding, and the ultimatedecision can be based on the distance between the embedding vectors ofthe two pixels from the pair. Once the ML model is trained, embodimentsuse it to match between corresponding keys and values, which may beobtained beforehand (e.g., by a key value differentiation-unit) orafterward (e.g., using an OCR engine to analyze the key and valueboxes), thus enabling the system to “read” the form automatically viaknowledge of matching keys and values. Moreover, some embodiments usingthis approach can bypass the need to “read” all the text on the form,which oftentimes contains a lot of irrelevant content such as fillinginstructions, form version indicators, legal notices, and so on.

In this example environment, the key-value association unit 319 may beimplemented as a component of a document processing service 112, whichas indicated above may be implemented using software executed by one ormultiple computing devices, and in various embodiments could beimplemented in hardware or a combination of both hardware and software.For example, in some embodiments, such as when the key-value associationunit 319 is implemented as a component or part of a service (e.g.,document processing service 112) within a provider network 100 and needsto operate in large scale to process substantial numbers of requests,the key-value association unit 319 (and/or document processing service112) may be implemented in a distributed manner—e.g., using softwareexecuted by multiple computing devices in a same or different geographiclocation(s).

As indicated above, the document processing service 112 in someembodiments is implemented via software, and includes severalsub-components that may also be implemented in whole or in part usingsoftware, such as a key-value differentiation unit 316, the key-valueassociation unit 319, an embedding generator 318, etc. The documentprocessing service 112 may perform various functions upon data (e.g.,images) carrying or providing representations of documents. For example,the document processing service 112 may identify text represented withinan image and process the text in intelligent ways, e.g., based on wherethe text is located within the representation.

As shown by optional circle (1), an electronic document 310 may beprovided in some manner. For example, a client device 104 of a user 106may upload the electronic document 310 to a document processing service112, to a storage service of a provider network 100, or to anotherlocation.

In this example, the electronic document 310 includes a representationof a form with multiple key-value units. One example key-value unit 808is shown and as a rectangular region including a key of “2. EMPLOYER'SCITY” with a corresponding value of “BOSTON.” In some scenarios akey-value unit may comprise a rectangle or square area from theelectronic document 310, though in other embodiments the key-value unitmay be of a different shape, e.g., some other type of polygon thatdelineates a region including a key and a value. Thus, the electronicdocument 310 includes one or multiple key-value units 808, one ormultiple corresponding keys 814, and one or more values 820—notably, itis not always the case that a key 814 will have a corresponding value820.

In some embodiments, the electronic document 310 (or a derivationtherefrom, such as an image file corresponding to a visualrepresentation of another format of electronic document 310) is providedat circle (2A) to an embedding generator 318 that generates a set ofper-pixel embeddings 322. The embedding generator 318 in someembodiments comprises a machine learning (ML) model, such as a neuralnetwork, that was trained utilizing a loss function that separates theembeddings it generates for pixels in different key-value units (thatis, places them far apart according to a distance measure in ann-dimensional space) and places the embeddings it generates for pixelsin the same key-value unit close together.

As one example, the ML model may comprise a neural network that usesthree parallel lines of two convolutional blocks each. The threeparallel lines employ three sizes of convolution kernels (e.g., small,medium, and large). All six feature sets may be collected (e.g., threeresolutions and two convolutional blocks) and reduced to the embeddingdimension by 1*1 convolutions. In this example, the loss function usedfor training this network may sample pairs of pixels and try tocorrectly predict whether they belong to the same key-value unit or not.For example, the distance between the embedding vectors may be convertedto a probability that the two pixels belong to the same segment andmeasure the quality of the embedding by the cross-entropy loss betweenthe actual predictions for pixel pairs, and the ground-truth 1/0probabilities for these pixel pairs. The pairs may be sampled both fromthe same segments, and from different ones.

As another example, the embedding generator 318 may comprise a networkbased on a U-net backbone that was trained with a loss function thattries to have the embeddings of pixels in different key-value units farapart, and those of pixels in the same key-value unit close together.Notably, this loss function formulation is quite similar to the lossfunction of the first example, though the two loss formulations are notidentical.

In some embodiments, the embedding generator 318 may utilizeconvolutions via multiple filters at several different scales (e.g., 3×3pixels, 5×5 pixels, etc.) and, e.g., implement linear combinations ofthese to generate the embeddings 322. For example, the embeddinggenerator 318 might take a window (of size 3×3 pixels), run the windowacross the image, and do a convolution with this image portion,resulting in an embedding.

In some embodiments, the embedding generator 318 analyzes convolutionsat multiple filters at several scales, concatenates them, and performs alinear combination of the results, resulting in the embeddings. Themultiples scales (e.g., 2×2, 3×3, 5×5) may be used to allow for a largeenough scale (or bigger box) to be used that is able “see” the areasurrounding the immediate pixel—as a “correct” size may not bedeterminable, embodiments utilize multiple scales by weighting them withweights that are being learned, allowing the embedding generator 318 toadapt to the actual scales seen in the training run. In someembodiments, each per-pixel embedding may be a vector of multiplevalues, e.g., a 15-number vector.

The electronic document 310 may also be operated upon by a key-valuedifferentiation unit 316 (e.g., partially or completely in parallel withthe embedding generator 318 or at a different time) to identifylocations 324 of keys and values from the electronic document 310, andoptionally the actual keys and values themselves.

The per-pixel embeddings 322 and locations 324 of keys and value (orportions of the keys and values), optionally along with the originalelectronic document 310 or image version 830 thereof, are provided tothe key-value association unit 319 (e.g., via a queue, via a sharedstorage location, within a request message or function call, etc.).

At circle (3), a graph constructor 832 may use this information toconstruct a graph 826. The graph 826 may be a weighted bipartite graphwith a first set of nodes corresponding to the keys, and a second set ofnodes corresponding to the values. Each node from the first set of nodes(representing the keys) may include an edge to each of the second set ofnodes (representing the values). Each such edge has a correspondingweight that is determined by the graph constructor 832 based on theper-pixel embeddings 322.

To determine a weight, between a “key” node and an “edge” node, a pixelfrom the region of the key (e.g., some sort of center or edge pixel) isidentified and a pixel from the region of the value is identified.Conceptually, a line can be stretched between these pixels, and the linecan be “traced” to identify an “edge” in the embeddings where there is alarge “jump” in the distance between two “neighboring” (e.g., one pixelaway, two pixels away, three pixels away, etc.) embeddings—in this casewhen a large jump is found, it is likely that the key and the value arenot of the same key-value unit. For example, the line can be traced atan interval of some number of pixels (e.g., 3 pixels), and the distancebetween each pairing of two embeddings can be determined using one of avariety of types of distance measure formulations known to those ofskill in the art. The overall edge weight may be based on thesedistances—e.g., a maximum distance value found may be used as the edge'sweight.

Alternatively, in some embodiments (perhaps when the embedding generator318 employs the network described above based on a U-net backbone orsimilar) the edge weight may simply be set as the distance between anaverage embedding from each of the two regions (i.e., the region orpolygon/bounding box encompassing the key, and the region orpolygon/bounding box encompassing the value).

With the constructed bipartite graph 826 with edge weights, at circle(4) a pairing unit 834 may “pair” up ones of the keys with one of thevalues based on a graph analysis solving an association problem. Forexample, in some embodiments the pairing unit 834 may partition thegraph to identify pairs of the keys with the values that results in anoverall minimum edge cost.

Alternatively, a greedy approach could be employed, e.g., to pair thetwo nodes with an overall lowest edge weight, remove those nodes andedges from the graph (or from further consideration), and then pair anext two nodes with a next minimal edge weight, and so on. However,although this approach is sufficient in many scenarios, the overallresults in some cases may be better using a non-greedy overall graphanalysis described above, which optimizes for a minimum overall distancefor the graph.

At this point, the key-value association unit 319, has a set of all keyswith their associated values, and this information may be used in avariety of ways—e.g., represented in JavaScript Object Notation (JSON)format and sent back to a client device 104, stored at some storagelocation, etc.

For further detail, FIG. 9 is a diagram illustrating an exemplary lossfunction 900 useful for training one or more machine learning models ofthe embedding generator 318 according to some embodiments. In someembodiments, the distance between the embedding vectors is converted toa probability that the two pixels belong to the same segment, and thequality of the embedding is measured by the cross-entropy loss betweenthe actual predictions for pixel pairs, and the ground-truth 1/0probabilities for these pixel pairs. The pairs are sampled both from thesame segments, and from different ones.

Thus, in some embodiments, the network is trained by minimizing thefollowing loss 900 shown, where S is the set of pixels that we choose,y_(p) is the instance label of pixel p, and w_(pq) is the weight of theloss on the similarity between p and q. The weights w_(pq) are set tovalues inversely proportional to the size of the instances p and qbelong to, so the loss will not become biased towards the largerexamples. The weights may be normalized so that Σ_(p,q)w_(pq)=1.

For further clarity of understanding, FIG. 10 is a diagram illustratingexemplary edge weight determination, graph construction, and graphpartitioning operations used for automated form understanding vialayout-agnostic identification of keys and corresponding valuesaccording to some embodiments. As shown at 1000, a center pixel of one“key” node is represented as being a black diamond and center pixels ofthe “value” nodes are represented as “X” marks. In this example, twodotted lines are illustrated connecting a center point of the key withtwo different values. Distances calculated based on three-pixel windowsfrom the horizontal line are shown on the graph on the bottom. In thisexample, distances are real values between 0.0 (indicating a highlikelihood of the pixels being in a same key-value unit) and 1.0(indicating a low likelihood of the pixels being in a same key-valueunit). As shown in the graph on the bottom, a largest (or maximum)distance is identified as 0.16.

Similarly, the vertical dotted line is also walked in three-pixelwindows and the distances between these embeddings is shown in the graphon the right side. In this case, a large “edge” is detected inapproximately the middle of the line, which likely corresponds to aborder between different key-value units. Thus, in this case a maximumdistance is identified as 0.97. This process can repeat between the keypixel and all the other “center” value pixels, and the maximum distancescan be used to create weights for edges in a graph that is constructedas shown at 1002. In this example, the bipartite graph includes a columnon the left for the keys, each of which is connected to each of thenodes in the column on the right representing values. Each edgecomprises a weight value, e.g., the maximum distance found on the linebetween the corresponding center points at 1000. In this example, only afew weights for one of the key nodes is shown for simplicity ofillustration and understanding—here, 0.97, 0.16, 0.91, and 0.98—thoughthere may be more or fewer edges for each key node, and there may bemore or fewer key or value nodes in different scenarios depending on thenumber of keys and values that were identified in the document.

At 1004, the graph can be partitioned such that every key node isconnected zero or one value nodes. This may include the use of anon-greedy optimization algorithm to find an overall partition resultingin a minimum total edge weight, a greedy algorithm that selects thesmallest edge weights first and continues on, etc.

In some embodiments, an edge will only be selected and used to associatea key and a value if the edge weight is less than a particular thresholddistance/weight, which may allow for no value to be associated with akey, which often happens if a value field in the form is blank.

Accordingly, the keys are associated with their corresponding values,which may be used to generate a result for the document processingdescribed herein.

FIG. 11 is a flow diagram illustrating exemplary operations 1100 of amethod for layout-agnostic complex document processing according to someembodiments. Some or all of the operations 1100 (or other processesdescribed herein, or variations, and/or combinations thereof) areperformed under the control of one or more computer systems configuredwith executable instructions and are implemented as code (e.g.,executable instructions, one or more computer programs, or one or moreapplications) executing collectively on one or more processors, byhardware or combinations thereof. The code is stored on acomputer-readable storage medium, for example, in the form of a computerprogram comprising instructions executable by one or more processors.The computer-readable storage medium is non-transitory. In someembodiments, one or more (or all) of the operations 1100 are performedby the document processing service 112 of the other figures.

The operations 1100 include, at block 1105, receiving a request toanalyze a document encoded by a file. The request may include data ofthe file or identify a storage location of the file. The file may be animage (e.g., from a digital camera or optical sensor) taken of thedocument, or a document file type such as a PDF. In some embodiments,the request is received at a web service endpoint of a provider networkand may be carried by an HTTP request message.

According to some embodiments, the operations 1100 may optionallyinclude determining that a value (e.g., a size of the file, a number ofpages of the file, a number of pending requests for a same user, etc.)meets or exceeds a threshold; determining that an amount of idle computecapacity does not meet or exceed an availability threshold; and (e.g.,in response to these two determinations) suspending processing of thefile for a period of time. Such thresholds may be configured to preventthe processing of a particularly large document (e.g., a 1 GB document,a million-page document) when there is not currently (or, projected noto be) sufficient processing capacity to both process the large documentand process other requests from other users. Additionally (oralternatively), some embodiments set a threshold based on a number ofrequests being processed by the system for a particular user (e.g., 5concurrent requests), to prevent any one user from submitting so manyprocessing requests at a single point in time (thus, somewhatmonopolizing the service) that the performance of the service would beimpacted for other users. Using such thresholds, the performance of thesystem can be ensured for its users under these problematic use casescenarios.

The operations 1100 include, at block 1110, identifying, within at leasta portion of the electronic document, one or more segments. Theidentifying may include using one or more ML models to identify distinctportions of the document that may be of different segment types. Theportion may comprise a “chunk” of the document, where a chunk may be anumber of pages.

The operations 1100 include, at block 1115, determining, via use of oneor more machine learning (ML) models, a classification of each of theone or more segments into one or more corresponding segment types. Theone or more ML models may have been trained using labeled trainingimages depicting receipts, forms, etc., so that the ML model(s) canclassify these types of segments. At least one of the one or morecorresponding segment types may be one of: form, receipt, invoice,paragraph, or table.

At block 1120, the operations 1100 include processing each of the one ormore segments using an analysis engine selected based the determinedsegment type for the segment to yield one or more results. Theprocessing may include use of specific ML models for the particularsegment types. For example, the processing for a form segment type mayinclude detecting keys and values using a first one or more ML models,and then detecting pairings of ones of the keys with ones of the valuesusing a second one or more ML models.

The operations 1100 include, at block 1125, transmitting a responseincluding an output based on the one or more results. In someembodiments, the response is destined to a client device located outsideof the provider network.

In some embodiments, the operations 1100 optionally include generating ajob identifier (ID) of a job; transmitting the job ID within a secondresponse, wherein the second response is associated with the request;and receiving a second request for the output of the job, the secondrequest comprising the job ID, wherein the response is associated withthe second request.

According to some embodiments, the operations 1100 may optionallyinclude splitting the file into a plurality of chunks; analyzing, atleast partially in a parallel manner, the plurality of chunks to yield aplurality of chunk results, wherein the portion is from a first chunk ofthe plurality of chunks; determining that the plurality of chunks haveall been analyzed; and combining the plurality of chunk results into theoutput value.

According to some embodiments, the response comprises a HyperTextTransfer Protocol (HTTP) response message sent in response to therequest, the request comprising an HTTP request message; the HTTPrequest message was received at a first Application ProgrammingInterface (API) endpoint; and the operations further comprise: sendinganother HTTP request message to another endpoint associated with ananalysis service, the another HTTP request message including data of thefile or identifying a location of the file, and receiving another HTTPresponse message from the another endpoint. In some embodiments, atleast one of the one or more corresponding segment types is form; andthe processing of at least one of the one or more segments using theanalysis engine comprises: identifying a plurality of keys of the formand a plurality of values of the form; and determining a plurality ofkey-value pairings, wherein at least one of the plurality of key-valuepairings comprises one of the plurality of keys and one of the pluralityof values that corresponds to the one key. In some embodiments,identifying the plurality of keys and the plurality of values is basedon use of a second machine learning (ML) model; and determining theplurality of key-value pairings is based on use of a third ML model.

FIG. 12 illustrates an example provider network (or “service providersystem”) environment according to some embodiments. A provider network1200 may provide resource virtualization to customers via one or morevirtualization services 1210 that allow customers to purchase, rent, orotherwise obtain instances 1212 of virtualized resources, including butnot limited to computation and storage resources, implemented on deviceswithin the provider network or networks in one or more data centers.Local Internet Protocol (IP) addresses 1216 may be associated with theresource instances 1212; the local IP addresses are the internal networkaddresses of the resource instances 1212 on the provider network 1200.In some embodiments, the provider network 1200 may also provide publicIP addresses 1214 and/or public IP address ranges (e.g., InternetProtocol version 4 (IPv4) or Internet Protocol version 6 (IPv6)addresses) that customers may obtain from the provider 1200.

Conventionally, the provider network 1200, via the virtualizationservices 1210, may allow a customer of the service provider (e.g., acustomer that operates one or more client networks 1250A-1250C includingone or more customer device(s) 1252) to dynamically associate at leastsome public IP addresses 1214 assigned or allocated to the customer withparticular resource instances 1212 assigned to the customer. Theprovider network 1200 may also allow the customer to remap a public IPaddress 1214, previously mapped to one virtualized computing resourceinstance 1212 allocated to the customer, to another virtualizedcomputing resource instance 1212 that is also allocated to the customer.Using the virtualized computing resource instances 1212 and public IPaddresses 1214 provided by the service provider, a customer of theservice provider such as the operator of customer network(s) 1250A-1250Cmay, for example, implement customer-specific applications and presentthe customer's applications on an intermediate network 1240, such as theInternet. Other network entities 1220 on the intermediate network 1240may then generate traffic to a destination public IP address 1214published by the customer network(s) 1250A-1250C; the traffic is routedto the service provider data center, and at the data center is routed,via a network substrate, to the local IP address 1216 of the virtualizedcomputing resource instance 1212 currently mapped to the destinationpublic IP address 1214. Similarly, response traffic from the virtualizedcomputing resource instance 1212 may be routed via the network substrateback onto the intermediate network 1240 to the source entity 1220.

Local IP addresses, as used herein, refer to the internal or “private”network addresses, for example, of resource instances in a providernetwork. Local IP addresses can be within address blocks reserved byInternet Engineering Task Force (IETF) Request for Comments (RFC) 1918and/or of an address format specified by IETF RFC 4193 and may bemutable within the provider network. Network traffic originating outsidethe provider network is not directly routed to local IP addresses;instead, the traffic uses public IP addresses that are mapped to thelocal IP addresses of the resource instances. The provider network mayinclude networking devices or appliances that provide network addresstranslation (NAT) or similar functionality to perform the mapping frompublic IP addresses to local IP addresses and vice versa.

Public IP addresses are Internet mutable network addresses that areassigned to resource instances, either by the service provider or by thecustomer. Traffic routed to a public IP address is translated, forexample via 1:1 NAT, and forwarded to the respective local IP address ofa resource instance.

Some public IP addresses may be assigned by the provider networkinfrastructure to particular resource instances; these public IPaddresses may be referred to as standard public IP addresses, or simplystandard IP addresses. In some embodiments, the mapping of a standard IPaddress to a local IP address of a resource instance is the defaultlaunch configuration for all resource instance types.

At least some public IP addresses may be allocated to or obtained bycustomers of the provider network 1200; a customer may then assign theirallocated public IP addresses to particular resource instances allocatedto the customer. These public IP addresses may be referred to ascustomer public IP addresses, or simply customer IP addresses. Insteadof being assigned by the provider network 1200 to resource instances asin the case of standard IP addresses, customer IP addresses may beassigned to resource instances by the customers, for example via an APIprovided by the service provider. Unlike standard IP addresses, customerIP addresses are allocated to customer accounts and can be remapped toother resource instances by the respective customers as necessary ordesired. A customer IP address is associated with a customer's account,not a particular resource instance, and the customer controls that IPaddress until the customer chooses to release it. Unlike conventionalstatic IP addresses, customer IP addresses allow the customer to maskresource instance or availability zone failures by remapping thecustomer's public IP addresses to any resource instance associated withthe customer's account. The customer IP addresses, for example, enable acustomer to engineer around problems with the customer's resourceinstances or software by remapping customer IP addresses to replacementresource instances.

FIG. 13 is a block diagram of an example provider network that providesa storage service and a hardware virtualization service to customers,according to some embodiments. Hardware virtualization service 1320provides multiple computation resources 1324 (e.g., VMs) to customers.The computation resources 1324 may, for example, be rented or leased tocustomers of the provider network 1300 (e.g., to a customer thatimplements customer network 1350). Each computation resource 1324 may beprovided with one or more local IP addresses. Provider network 1300 maybe configured to route packets from the local IP addresses of thecomputation resources 1324 to public Internet destinations, and frompublic Internet sources to the local IP addresses of computationresources 1324.

Provider network 1300 may provide a customer network 1350, for examplecoupled to intermediate network 1340 via local network 1356, the abilityto implement virtual computing systems 1392 via hardware virtualizationservice 1320 coupled to intermediate network 1340 and to providernetwork 1300. In some embodiments, hardware virtualization service 1320may provide one or more APIs 1302, for example a web services interface,via which a customer network 1350 may access functionality provided bythe hardware virtualization service 1320, for example via a console 1394(e.g., a web-based application, standalone application, mobileapplication, etc.). In some embodiments, at the provider network 1300,each virtual computing system 1392 at customer network 1350 maycorrespond to a computation resource 1324 that is leased, rented, orotherwise provided to customer network 1350.

From an instance of a virtual computing system 1392 and/or anothercustomer device 1390 (e.g., via console 1394), the customer may accessthe functionality of storage service 1310, for example via one or moreAPIs 1302, to access data from and store data to storage resources1318A-1318N of a virtual data store 1316 (e.g., a folder or “bucket”, avirtualized volume, a database, etc.) provided by the provider network1300. In some embodiments, a virtualized data store gateway (not shown)may be provided at the customer network 1350 that may locally cache atleast some data, for example frequently-accessed or critical data, andthat may communicate with storage service 1310 via one or morecommunications channels to upload new or modified data from a localcache so that the primary store of data (virtualized data store 1316) ismaintained. In some embodiments, a user, via a virtual computing system1392 and/or on another customer device 1390, may mount and accessvirtual data store 1316 volumes via storage service 1310 acting as astorage virtualization service, and these volumes may appear to the useras local (virtualized) storage 1398.

While not shown in FIG. 13, the virtualization service(s) may also beaccessed from resource instances within the provider network 1300 viaAPI(s) 1302. For example, a customer, appliance service provider, orother entity may access a virtualization service from within arespective virtual network on the provider network 1300 via an API 1302to request allocation of one or more resource instances within thevirtual network or within another virtual network.

Illustrative System

In some embodiments, a system that implements a portion or all of thetechniques for layout-agnostic complex document processing as describedherein may include a general-purpose computer system that includes or isconfigured to access one or more computer-accessible media, such ascomputer system 1400 illustrated in FIG. 14. In the illustratedembodiment, computer system 1400 includes one or more processors 1410coupled to a system memory 1420 via an input/output (I/O) interface1430. Computer system 1400 further includes a network interface 1440coupled to I/O interface 1430. While FIG. 14 shows computer system 1400as a single computing device, in various embodiments a computer system1400 may include one computing device or any number of computing devicesconfigured to work together as a single computer system 1400.

In various embodiments, computer system 1400 may be a uniprocessorsystem including one processor 1410, or a multiprocessor systemincluding several processors 1410 (e.g., two, four, eight, or anothersuitable number). Processors 1410 may be any suitable processors capableof executing instructions. For example, in various embodiments,processors 1410 may be general-purpose or embedded processorsimplementing any of a variety of instruction set architectures (ISAs),such as the x86, ARM, PowerPC, SPARC, or MIPS ISAs, or any othersuitable ISA. In multiprocessor systems, each of processors 1410 maycommonly, but not necessarily, implement the same ISA.

System memory 1420 may store instructions and data accessible byprocessor(s) 1410. In various embodiments, system memory 1420 may beimplemented using any suitable memory technology, such as random-accessmemory (RAM), static RAM (SRAM), synchronous dynamic RAM (SDRAM),nonvolatile/Flash-type memory, or any other type of memory. In theillustrated embodiment, program instructions and data implementing oneor more desired functions, such as those methods, techniques, and datadescribed above are shown stored within system memory 1420 as code 1425and data 1426.

In one embodiment, I/O interface 1430 may be configured to coordinateI/O traffic between processor 1410, system memory 1420, and anyperipheral devices in the device, including network interface 1440 orother peripheral interfaces. In some embodiments, I/O interface 1430 mayperform any necessary protocol, timing or other data transformations toconvert data signals from one component (e.g., system memory 1420) intoa format suitable for use by another component (e.g., processor 1410).In some embodiments, I/O interface 1430 may include support for devicesattached through various types of peripheral buses, such as a variant ofthe Peripheral Component Interconnect (PCI) bus standard or theUniversal Serial Bus (USB) standard, for example. In some embodiments,the function of I/O interface 1430 may be split into two or moreseparate components, such as a north bridge and a south bridge, forexample. Also, in some embodiments some or all of the functionality ofI/O interface 1430, such as an interface to system memory 1420, may beincorporated directly into processor 1410.

Network interface 1440 may be configured to allow data to be exchangedbetween computer system 1400 and other devices 1460 attached to anetwork or networks 1450, such as other computer systems or devices asillustrated in FIG. 1, for example. In various embodiments, networkinterface 1440 may support communication via any suitable wired orwireless general data networks, such as types of Ethernet network, forexample. Additionally, network interface 1440 may support communicationvia telecommunications/telephony networks such as analog voice networksor digital fiber communications networks, via storage area networks(SANs) such as Fibre Channel SANs, or via I/O any other suitable type ofnetwork and/or protocol.

In some embodiments, a computer system 1400 includes one or more offloadcards 1470 (including one or more processors 1475, and possiblyincluding the one or more network interfaces 1440) that are connectedusing an I/O interface 1430 (e.g., a bus implementing a version of thePeripheral Component Interconnect-Express (PCI-E) standard, or anotherinterconnect such as a QuickPath interconnect (QPI) or UltraPathinterconnect (UPI)). For example, in some embodiments the computersystem 1400 may act as a host electronic device (e.g., operating as partof a hardware virtualization service) that hosts compute instances, andthe one or more offload cards 1470 execute a virtualization manager thatcan manage compute instances that execute on the host electronic device.As an example, in some embodiments the offload card(s) 1470 can performcompute instance management operations such as pausing and/or un-pausingcompute instances, launching and/or terminating compute instances,performing memory transfer/copying operations, etc. These managementoperations may, in some embodiments, be performed by the offload card(s)1470 in coordination with a hypervisor (e.g., upon a request from ahypervisor) that is executed by the other processors 1410A-1410N of thecomputer system 1400. However, in some embodiments the virtualizationmanager implemented by the offload card(s) 1470 can accommodate requestsfrom other entities (e.g., from compute instances themselves), and maynot coordinate with (or service) any separate hypervisor.

In some embodiments, system memory 1420 may be one embodiment of acomputer-accessible medium configured to store program instructions anddata as described above. However, in other embodiments, programinstructions and/or data may be received, sent or stored upon differenttypes of computer-accessible media. Generally speaking, acomputer-accessible medium may include non-transitory storage media ormemory media such as magnetic or optical media, e.g., disk or DVD/CDcoupled to computer system 1400 via I/O interface 1430. A non-transitorycomputer-accessible storage medium may also include any volatile ornon-volatile media such as RAM (e.g., SDRAM, double data rate (DDR)SDRAM, SRAM, etc.), read only memory (ROM), etc., that may be includedin some embodiments of computer system 1400 as system memory 1420 oranother type of memory. Further, a computer-accessible medium mayinclude transmission media or signals such as electrical,electromagnetic, or digital signals, conveyed via a communication mediumsuch as a network and/or a wireless link, such as may be implemented vianetwork interface 1440.

In the preceding description, various embodiments are described. Forpurposes of explanation, specific configurations and details are setforth in order to provide a thorough understanding of the embodiments.However, it will also be apparent to one skilled in the art that theembodiments may be practiced without the specific details. Furthermore,well-known features may be omitted or simplified in order not to obscurethe embodiment being described.

Bracketed text and blocks with dashed borders (e.g., large dashes, smalldashes, dot-dash, and dots) are used herein to illustrate optionaloperations that add additional features to some embodiments. However,such notation should not be taken to mean that these are the onlyoptions or optional operations, and/or that blocks with solid bordersare not optional in certain embodiments.

Reference numerals with suffix letters (e.g., 1318A-1318N) may be usedto indicate that there can be one or multiple instances of thereferenced entity in various embodiments, and when there are multipleinstances, each does not need to be identical but may instead share somegeneral traits or act in common ways. Further, the particular suffixesused are not meant to imply that a particular amount of the entityexists unless specifically indicated to the contrary. Thus, two entitiesusing the same or different suffix letters may or may not have the samenumber of instances in various embodiments.

References to “one embodiment,” “an embodiment,” “an exampleembodiment,” etc., indicate that the embodiment described may include aparticular feature, structure, or characteristic, but every embodimentmay not necessarily include the particular feature, structure, orcharacteristic. Moreover, such phrases are not necessarily referring tothe same embodiment. Further, when a particular feature, structure, orcharacteristic is described in connection with an embodiment, it issubmitted that it is within the knowledge of one skilled in the art toaffect such feature, structure, or characteristic in connection withother embodiments whether or not explicitly described.

Moreover, in the various embodiments described above, unlessspecifically noted otherwise, disjunctive language such as the phrase“at least one of A, B, or C” is intended to be understood to mean eitherA, B, or C, or any combination thereof (e.g., A, B, and/or C). As such,disjunctive language is not intended to, nor should it be understood to,imply that a given embodiment requires at least one of A, at least oneof B, or at least one of C to each be present.

The specification and drawings are, accordingly, to be regarded in anillustrative rather than a restrictive sense. It will, however, beevident that various modifications and changes may be made thereuntowithout departing from the broader spirit and scope of the disclosure asset forth in the claims.

What is claimed is:
 1. A computer-implemented method comprising:receiving, at a webservice endpoint of a provider network, a firstrequest to analyze a document encoded by a file the request includingthe file or identifying a storage location of the file; generating a jobidentifier (ID) for a job; sending a first response to the first requestthat includes the job ID; running the job, comprising: generating aplurality of chunks from the file; for a first chunk of the plurality ofchunks, identifying one or more segments within the chunk; determining,via use of one or more machine learning (ML) models, a classification ofeach of the one or more segments into one or more corresponding segmenttypes; processing each of the one or more segments using an analysisengine selected based the determined segment type for the segment toyield one or more results; and constructing a result for the job basedat least in part on the one or more results; receiving, at thewebservice endpoint, a second request for the result for the job, thesecond request including the job ID; and transmitting a second responseto the second request that includes the result or identifies a storagelocation of the result.
 2. The computer-implemented method of claim 1,wherein at least one of the one or more corresponding segment typescomprises one of: form, receipt, invoice, paragraph, or table.
 3. Thecomputer-implemented method of claim 2, wherein the at least one segmenttype is form, and wherein the result identifies a plurality of key-valuepairings, wherein each key-value pairing indicates a key detected in theform and a value detected in the form that corresponds to the key.
 4. Acomputer-implemented method comprising: receiving a request to analyze adocument encoded by a file; identifying, within at least a portion ofthe electronic document, one or more segments; determining, via use ofone or more machine learning (ML) models, a classification of each ofthe one or more segments into one or more corresponding segment types;processing each of the one or more segments using an analysis engineselected based on the determined segment type for the segment to yieldone or more results; and transmitting a response including an outputbased on the one or more results.
 5. The computer-implemented method ofclaim 4, wherein the request includes data of the file or identifies alocation of the file.
 6. The computer-implemented method of claim 4,further comprising: generating a job identifier (ID) of a job;transmitting the job ID within a second response, wherein the secondresponse is associated with the request; and receiving a second requestfor the output of the job, the second request comprising the job ID,wherein the response is associated with the second request.
 7. Thecomputer-implemented method of claim 6, further comprising: splittingthe file into a plurality of chunks; analyzing, at least partially in aparallel manner, the plurality of chunks to yield a plurality of chunkresults, wherein the portion is from a first chunk of the plurality ofchunks; determining that the plurality of chunks have all been analyzed;and combining the plurality of chunk results into the output value. 8.The computer-implemented method of claim 6, further comprising:determining that a value meets or exceeds a threshold, the valuecomprising a size of the file, a number of pages of the file, or anamount of requests pending for a same user; determining that an amountof idle compute capacity does not meet or exceed an availabilitythreshold; and suspending processing of the file for a period of time.9. The computer-implemented method of claim 4, wherein: the responsecomprises a HyperText Transfer Protocol (HTTP) response message sent inresponse to the request, the request comprising an HTTP request message;the HTTP request message was received at a first Application ProgrammingInterface (API) endpoint; and the method further comprises: sendinganother HTTP request message to another endpoint associated with ananalysis service, the another HTTP request message including data of thefile or identifying a location of the file, and receiving another HTTPresponse message from the another endpoint.
 10. The computer-implementedmethod of claim 4, wherein at least one of the one or more correspondingsegment types comprises one of: form, receipt, invoice, paragraph, ortable.
 11. The computer-implemented method of claim 10, wherein: atleast one of the one or more corresponding segment types is form; andthe processing of at least one of the one or more segments using theanalysis engine comprises: identifying a plurality of keys of the formand a plurality of values of the form; and determining a plurality ofkey-value pairings, wherein at least one of the plurality of key-valuepairings comprises one of the plurality of keys and one of the pluralityof values that corresponds to the one key.
 12. The computer-implementedmethod of claim 11, wherein: identifying the plurality of keys and theplurality of values is based on use of a second one or more ML models;and determining the plurality of key-value pairings is based on use of athird one or more ML models.
 13. The computer-implemented method ofclaim 4, wherein the request is received at a web service endpoint of aprovider network, and wherein the response is destined to a clientdevice located outside of the provider network.
 14. Thecomputer-implemented method of claim 4, wherein the file comprises anelectronic image taken of the document.
 15. A system comprising: astorage service implemented by a first one or more electronic deviceswithin a provider network; and a document processing service implementedby a second one or more electronic devices within the provider network,the document processing service including instructions that uponexecution cause the document processing service to: receive a request toanalyze a document encoded by a file, the request identifying a locationprovided by the storage service that stores the file; obtain the filefrom the storage service; identify, within at least a portion of theelectronic document, one or more segments; determine, via use of one ormore machine learning (ML) models, a classification of each of the oneor more segments into one or more corresponding segment types; processeach of the one or more segments using an analysis engine selected basedthe determined segment type for the segment to yield one or moreresults; and transmit a response including an output based on the one ormore results.
 16. The system of claim 15, wherein the request includesdata of the file or identifies a location of the file.
 17. The system ofclaim 15, wherein the instructions further cause the document processingservice to: generate a job identifier (ID) of a job; transmit the job IDwithin a second response, wherein the second response is associated withthe request; and receive a second request for the output of the job, thesecond request comprising the job ID, wherein the response is associatedwith the second request.
 18. The system of claim 17, wherein theinstructions further cause the document processing service to: split thefile into a plurality of chunks; analyze, at least partially in aparallel manner, the plurality of chunks to yield a plurality of chunkresults, wherein the portion is from a first chunk of the plurality ofchunks; determine that the plurality of chunks have all been analyzed;and combine the plurality of chunk results into the output value. 19.The system of claim 17, wherein the instructions further cause thedocument processing service to: determine that a value meets or exceedsa threshold, the threshold value comprising a size, a number of pages ofthe file, or an amount of requests pending for a same user; determinethat an amount of idle compute capacity does not meet or exceed anavailability threshold; and suspend processing of the file for a periodof time.
 20. The system of claim 15, wherein the request is carried by aHyperText Transfer Protocol (HTTP) request message received at a webservice endpoint of the provider network, and wherein the response iscarried by an HTTP response message destined to a client device locatedoutside of the provider network.